30 research outputs found

    Divorce Prediction with Machine Learning: Insights and LIME Interpretability

    Full text link
    Divorce is one of the most common social issues in developed countries like in the United States. Almost 50% of the recent marriages turn into an involuntary divorce or separation. While it is evident that people vary to a different extent, and even over time, an incident like Divorce does not interrupt the individual's daily activities; still, Divorce has a severe effect on the individual's mental health, and personal life. Within the scope of this research, the divorce prediction was carried out by evaluating a dataset named by the 'divorce predictor dataset' to correctly classify between married and Divorce people using six different machine learning algorithms- Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Gaussian Na\"ive Bayes (NB), and, Support Vector Machines (SVM). Preliminary computational results show that algorithms such as SVM, KNN, and LDA, can perform that task with an accuracy of 98.57%. This work's additional novel contribution is the detailed and comprehensive explanation of prediction probabilities using Local Interpretable Model-Agnostic Explanations (LIME). Utilizing LIME to analyze test results illustrates the possibility of differentiating between divorced and married couples. Finally, we have developed a divorce predictor app considering ten most important features that potentially affect couples in making decisions in their divorce, such tools can be used by any one in order to identify their relationship condition

    Data balancing approaches in quality, defect, and pattern analysis

    Get PDF
    The imbalanced ratio of data is one of the most significant challenges in various industrial domains. Consequently, numerous data-balancing approaches have been proposed over the years. However, most of these data-balancing methods come with their own limitations that can potentially impact data-driven decision-making models in critical sectors such as product quality assurance, manufacturing defect identification, and pattern recognition in healthcare diagnostics. This dissertation addresses three research questions related to data-balancing approaches: 1) What are the scopes of data-balancing approaches toward the major and minor samples? 2) What is the effect of traditional Machine Learning (ML) and Synthetic Minority Over-sampling Technique (SMOTE)-based data-balancing on imbalanced data analysis? and 3) How does imbalanced data affect the performance of Deep Learning (DL)-based models? To achieve these objectives, this dissertation thoroughly analyzes existing reference works and identifies their limitations. It has been observed that most existing data-balancing approaches have several limitations, such as creating noise during oversampling, removing important information during undersampling, and being unable to perform well with multidimensional data. Furthermore, it has also been observed that SMOTE-based approaches have been the most widely used data-balancing approaches as they can create synthetic samples that are easy to implement compared to other existing techniques. However, SMOTE also has its limitations, and therefore, it is required to identify whether there is any significant effect of SMOTE-based oversampled approaches on ML-based data-driven models' performance. To do that, the study conducts several hypothesis tests considering several popular ML algorithms with and without hyperparameter settings. Based on the overall hypothesis, it is found that, in many cases based on the reference dataset, there is no significant performance improvement on data-driven ML models once the imbalanced data is balanced using SMOTE approaches. Additionally, the study finds that SMOTE-based synthetic samples often do not follow the Gaussian distribution or do not follow the same distribution of the data as the original dataset. Therefore, the study suggests that Generative Adversarial Network (GAN)-based approaches could be a better alternative to develop more realistic samples and might overcome the limitations of SMOTE-based data-balancing approaches. However, GAN is often difficult to train, and very limited studies demonstrate the promising outcome of GAN-based tabular data balancing as GAN is mainly developed for image data generation. Additionally, GAN is hard to train as it is computationally not efficient. To overcome such limitations, the present study proposes several data-balancing approaches such as GAN-based oversampling (GBO), Support Vector Machine (SVM)-SMOTE-GAN (SSG), and Borderline-SMOTE-GAN (BSGAN). The proposed approaches outperform existing SMOTE-based data-balancing approaches in various highly imbalanced tabular datasets and can produce realistic samples. Additionally, the oversampled data follows the distribution of the original dataset. The dissertation later examines two case scenarios where data-balancing approaches can play crucial roles, specifically in healthcare diagnostics and additive manufacturing. The study considers several Chest radiography (X-ray) and Computed Tomography (CT)-scan image datasets for the healthcare diagnostics scenario to detect patients with COVID-19 symptoms. The study employs six different Transfer Learning (TL) approaches, namely Visual Geometry Group (VGG)16, Residual Network (ResNet)50, ResNet101, Inception-ResNet Version 2 (InceptionResNetV2), Mobile Network version 2 (MobileNetV2), and VGG19. Based on the overall analysis, it has been observed that, except for the ResNet-based model, most of the TL models have been able to detect patients with COVID-19 symptoms with an accuracy of almost 99\%. However, one potential drawback of TL approaches is that the models have been learning from the wrong regions. For example, instead of focusing on the infected lung regions, the TL-based models have been focusing on the non-infected regions. To address this issue, the study has updated the TL-based models to reduce the models' wrong localization. Similarly, the study conducts an additional investigation on an imbalanced dataset containing defect and non-defect images of 3D-printed cylinders. The results show that TL-based models are unable to locate the defect regions, highlighting the challenge of detecting defects using imbalanced data. To address this limitation, the study proposes preprocessing-based approaches, including algorithms such as Region of Interest Net (ROIN), Region of Interest and Histogram Equalizer Net (ROIHEN), and Region of Interest with Histogram Equalization and Details Enhancer Net (ROIHEDEN) to improve the model's performance and accurately identify the defect region. Furthermore, this dissertation employs various model interpretation techniques, such as Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and Gradient-weighted Class Activation Mapping (Grad-CAM), to gain insights into the features in numerical, categorical, and image data that characterize the models' predictions. These techniques are used across multiple experiments and significantly contribute to a better understanding the models' decision-making processes. Lastly, the study considers a small mixed dataset containing numerical, categorical, and image data. Such diverse data types are often challenging for developing data-driven ML models. The study proposes a computationally efficient and simple ML model to address these data types by leveraging the Multilayer Perceptron and Convolutional Neural Network (MLP-CNN). The proposed MLP-CNN models demonstrate superior accuracy in identifying COVID-19 patients' patterns compared to existing methods. In conclusion, this research proposes various approaches to tackle significant challenges associated with class imbalance problems, including the sensitivity of ML models to multidimensional imbalanced data, distribution issues arising from data expansion techniques, and the need for model explainability and interpretability. By addressing these issues, this study can potentially mitigate data balancing challenges across various industries, particularly those that involve quality, defect, and pattern analysis, such as healthcare diagnostics, additive manufacturing, and product quality. By providing valuable insights into the models' decision-making process, this research could pave the way for developing more accurate and robust ML models, thereby improving their performance in real-world applications

    Invariant Scattering Transform for Medical Imaging

    Full text link
    Over the years, the Invariant Scattering Transform (IST) technique has become popular for medical image analysis, including using wavelet transform computation using Convolutional Neural Networks (CNN) to capture patterns' scale and orientation in the input signal. IST aims to be invariant to transformations that are common in medical images, such as translation, rotation, scaling, and deformation, used to improve the performance in medical imaging applications such as segmentation, classification, and registration, which can be integrated into machine learning algorithms for disease detection, diagnosis, and treatment planning. Additionally, combining IST with deep learning approaches has the potential to leverage their strengths and enhance medical image analysis outcomes. This study provides an overview of IST in medical imaging by considering the types of IST, their application, limitations, and potential scopes for future researchers and practitioners

    Defect Analysis of 3D Printed Cylinder Object Using Transfer Learning Approaches

    Full text link
    Additive manufacturing (AM) is gaining attention across various industries like healthcare, aerospace, and automotive. However, identifying defects early in the AM process can reduce production costs and improve productivity - a key challenge. This study explored the effectiveness of machine learning (ML) approaches, specifically transfer learning (TL) models, for defect detection in 3D-printed cylinders. Images of cylinders were analyzed using models including VGG16, VGG19, ResNet50, ResNet101, InceptionResNetV2, and MobileNetV2. Performance was compared across two datasets using accuracy, precision, recall, and F1-score metrics. In the first study, VGG16, InceptionResNetV2, and MobileNetV2 achieved perfect scores. In contrast, ResNet50 had the lowest performance, with an average F1-score of 0.32. Similarly, in the second study, MobileNetV2 correctly classified all instances, while ResNet50 struggled with more false positives and fewer true positives, resulting in an F1-score of 0.75. Overall, the findings suggest certain TL models like MobileNetV2 can deliver high accuracy for AM defect classification, although performance varies across algorithms. The results provide insights into model optimization and integration needs for reliable automated defect analysis during 3D printing. By identifying the top-performing TL techniques, this study aims to enhance AM product quality through robust image-based monitoring and inspection

    Left and Right Hand Movements EEG Signals Classification Using Wavelet Transform and Probabilistic Neural Network

    Get PDF
    Electroencephalogram (EEG) signals have great importance in the area of brain-computer interface (BCI) which has diverse applications ranging from medicine to entertainment. BCI acquires brain signals, extracts informative features and generates control signals from the knowledge of these features for functioning of external devices. The objective of this work is twofold. Firstly, to extract suitable features related to hand movements and secondly, to discriminate the left and right hand movements signals finding effective classifier. This work is a continuation of our previous study where beta band was found compatible for hand movement analysis. The discrete wavelet transform (DWT) has been used to separate beta band of the EEG signal in order to extract features.  The performance of a probabilistic neural network (PNN) is investigated to find better classifier of left and right hand movements EEG signals and compared with classical back propagation based neural network. The obtained results shows that PNN (99.1%) has better classification rate than the BP (88.9%). The results of this study are expected to be helpful in brain computer interfacing for hand movements related bio-rehabilitation applications

    Study of Different Deep Learning Approach with Explainable AI for Screening Patients with COVID-19 Symptoms: Using CT Scan and Chest X-ray Image Dataset

    Full text link
    The outbreak of COVID-19 disease caused more than 100,000 deaths so far in the USA alone. It is necessary to conduct an initial screening of patients with the symptoms of COVID-19 disease to control the spread of the disease. However, it is becoming laborious to conduct the tests with the available testing kits due to the growing number of patients. Some studies proposed CT scan or chest X-ray images as an alternative solution. Therefore, it is essential to use every available resource, instead of either a CT scan or chest X-ray to conduct a large number of tests simultaneously. As a result, this study aims to develop a deep learning-based model that can detect COVID-19 patients with better accuracy both on CT scan and chest X-ray image dataset. In this work, eight different deep learning approaches such as VGG16, InceptionResNetV2, ResNet50, DenseNet201, VGG19, MobilenetV2, NasNetMobile, and ResNet15V2 have been tested on two dataset-one dataset includes 400 CT scan images, and another dataset includes 400 chest X-ray images studied. Besides, Local Interpretable Model-agnostic Explanations (LIME) is used to explain the model's interpretability. Using LIME, test results demonstrate that it is conceivable to interpret top features that should have worked to build a trust AI framework to distinguish between patients with COVID-19 symptoms with other patients.Comment: This is a work in progress, it should not be relied upon without context to guide clinical practice or health-related behavior and should not be reported in news media as established information without consulting multiple experts in the fiel

    Effect of Data Scaling Methods on Machine Learning Algorithms and Model Performance

    Get PDF
    Heart disease, one of the main reasons behind the high mortality rate around the world, requires a sophisticated and expensive diagnosis process. In the recent past, much literature has demonstrated machine learning approaches as an opportunity to efficiently diagnose heart disease patients. However, challenges associated with datasets such as missing data, inconsistent data, and mixed data (containing inconsistent missing data both as numerical and categorical) are often obstacles in medical diagnosis. This inconsistency led to a higher probability of misprediction and a misled result. Data preprocessing steps like feature reduction, data conversion, and data scaling are employed to form a standard dataset—such measures play a crucial role in reducing inaccuracy in final prediction. This paper aims to evaluate eleven machine learning (ML) algorithms—Logistic Regression (LR), Linear Discriminant Analysis (LDA), K-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Naive Bayes (NB), Support Vector Machine (SVM), XGBoost (XGB), Random Forest Classifier (RF), Gradient Boost (GB), AdaBoost (AB), Extra Tree Classifier (ET)—and six different data scaling methods—Normalization (NR), Standscale (SS), MinMax (MM), MaxAbs (MA), Robust Scaler (RS), and Quantile Transformer (QT) on a dataset comprising of information of patients with heart disease. The result shows that CART, along with RS or QT, outperforms all other ML algorithms with 100% accuracy, 100% precision, 99% recall, and 100% F1 score. The study outcomes demonstrate that the model’s performance varies depending on the data scaling method.Open Access fees paid for in whole or in part by the University of Oklahoma Libraries.Ye

    BSGAN: A Novel Oversampling Technique for Imbalanced Pattern Recognitions

    Full text link
    Class imbalanced problems (CIP) are one of the potential challenges in developing unbiased Machine Learning (ML) models for predictions. CIP occurs when data samples are not equally distributed between the two or multiple classes. Borderline-Synthetic Minority Oversampling Techniques (SMOTE) is one of the approaches that has been used to balance the imbalance data by oversampling the minor (limited) samples. One of the potential drawbacks of existing Borderline-SMOTE is that it focuses on the data samples that lay at the border point and gives more attention to the extreme observations, ultimately limiting the creation of more diverse data after oversampling, and that is the almost scenario for the most of the borderline-SMOTE based oversampling strategies. As an effect, marginalization occurs after oversampling. To address these issues, in this work, we propose a hybrid oversampling technique by combining the power of borderline SMOTE and Generative Adversarial Network to generate more diverse data that follow Gaussian distributions. We named it BSGAN and tested it on four highly imbalanced datasets: Ecoli, Wine quality, Yeast, and Abalone. Our preliminary computational results reveal that BSGAN outperformed existing borderline SMOTE and GAN-based oversampling techniques and created a more diverse dataset that follows normal distribution after oversampling effect

    Machine-Learning-Based Disease Diagnosis: A Comprehensive Review

    No full text
    Globally, there is a substantial unmet need to diagnose various diseases effectively. The complexity of the different disease mechanisms and underlying symptoms of the patient population presents massive challenges in developing the early diagnosis tool and effective treatment. Machine learning (ML), an area of artificial intelligence (AI), enables researchers, physicians, and patients to solve some of these issues. Based on relevant research, this review explains how machine learning (ML) is being used to help in the early identification of numerous diseases. Initially, a bibliometric analysis of the publication is carried out using data from the Scopus and Web of Science (WOS) databases. The bibliometric study of 1216 publications was undertaken to determine the most prolific authors, nations, organizations, and most cited articles. The review then summarizes the most recent trends and approaches in machine-learning-based disease diagnosis (MLBDD), considering the following factors: algorithm, disease types, data type, application, and evaluation metrics. Finally, in this paper, we highlight key results and provides insight into future trends and opportunities in the MLBDD area

    A Comparative Analysis on Suicidal Ideation Detection Using NLP, Machine, and Deep Learning

    No full text
    Social networks are essential resources to obtain information about people’s opinions and feelings towards various issues as they share their views with their friends and family. Suicidal ideation detection via online social network analysis has emerged as an essential research topic with significant difficulties in the fields of NLP and psychology in recent years. With the proper exploitation of the information in social media, the complicated early symptoms of suicidal ideations can be discovered and hence, it can save many lives. This study offers a comparative analysis of multiple machine learning and deep learning models to identify suicidal thoughts from the social media platform Twitter. The principal purpose of our research is to achieve better model performance than prior research works to recognize early indications with high accuracy and avoid suicide attempts. We applied text pre-processing and feature extraction approaches such as CountVectorizer and word embedding, and trained several machine learning and deep learning models for such a goal. Experiments were conducted on a dataset of 49,178 instances retrieved from live tweets by 18 suicidal and non-suicidal keywords using Python Tweepy API. Our experimental findings reveal that the RF model can achieve the highest classification score among machine learning algorithms, with an accuracy of 93% and an F1 score of 0.92. However, training the deep learning classifiers with word embedding increases the performance of ML models, where the BiLSTM model reaches an accuracy of 93.6% and a 0.93 F1 score
    corecore